Introduction to {infer}

  • A common question an analyst faces is determining whether the observed effect/difference in the data is real or not.

  • For that, they define a null hypothesis, which supposes that the observed difference/effect in the data is due to randomness, with the alternative being the opposite. A common workflow is to perform a test statistic on the data to describe the difference/effect in the data. Based on the test statistic, a p-value is calculated and if the p-value is below a certain threshold, the null hypothesis can be rejected.

  • A significant pain point is selecting the appropriate test and ensuring it works for your data and use case. There are dozens of different tests, all with different assumptions and nuances.

  • A computer scientist named Allen Downey coined the now famous “There is only one test” framework, in which essentially all the common hypothesis tests can be operationalized like so:

  • This common workflow underpins the functions in the infer package. It is designed around the 4 main tasks an analyst performs when conducting common statistical hypothesis tests:

  • specify(): specify the variable(s) and relationship between them

  • hypothesize(): define null hypothesis

  • generate(): generate data reflecting the null hypothesis

  • calculate(): calculate a distribution of statistics from the generated data to form the null distribution

Common workflow for common hypothesis testing

To install the package:

install.packages("infer")

We’ll also need:

library(tidyverse)
library(infer)
library(palmerpenguins)
library(ggridges)
library(broom)
library(skimr)

Comparing differences in two groups (t-test)

Exploratory data analysis

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
penguins_sans_chin <- penguins %>% filter(species != "Chinstrap")
penguins_sans_chin %>% 
  ggplot(aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot() +
  scale_fill_brewer(palette = 3) + 
  # scale_y_continuous(breaks = seq(1, 10, 1)) +
  labs(
    title = "Body mass by species",
    x = NULL, y = "Body mass (g)") +
  theme_minimal()

penguins_sans_chin %>% 
  ggplot(aes(x = body_mass_g, fill = species)) +
  geom_histogram(binwidth = 100, color = "white") +
  scale_fill_brewer(palette = 3, guide = F) + 
  labs(y = "Count", x = "Body mass (g)") +
  facet_wrap(~ species) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank())

penguins_sans_chin %>% 
  ggplot(aes(x = body_mass_g, y = fct_rev(species), fill = species)) +
  stat_density_ridges(quantile_lines = TRUE, quantiles = 2, scale = 3, color = "white") + 
  scale_fill_brewer(palette = 3, guide = FALSE) + 
  # scale_x_continuous(breaks = seq(0, 10, 2)) +
  labs(x = "Body mass (g)", y = NULL) +
  theme_minimal()

penguins_sans_chin %>% 
  group_by(species) %>% 
  summarize(avg_body_mass = mean(body_mass_g, na.rm = TRUE))
## # A tibble: 2 × 2
##   species avg_body_mass
## * <fct>           <dbl>
## 1 Adelie          3701.
## 2 Gentoo          5076.

Using infer

We will use an infer workflow to determine if the difference in the body mass medians between the Adelie and Gentoo species.

  • The first step is using specify() to indicate which variables to use as the outcome, and which ones to use as the predictor.

  • The second step is adding calculate() to calculate the difference in median values between the two species,

diff <- 
  penguins_sans_chin %>% 
  specify(body_mass_g ~ species) %>% 
  calculate("diff in medians", 
            order = c("Gentoo", "Adelie")) # gives the order to subtract the mean values in
diff
## Response: body_mass_g (numeric)
## Explanatory: species (factor)
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  1300

The observed difference in the median body masses is +1300 grams for Gentoo relative to Adelie. How likely is this difference due to chance/randomness? We need to test this, by determining whether it is likely to see this difference in a simulated world where there is no difference between body masses that’s explained by species type.

To do this, we use hypothesize() that the two species are independent of each other and that there’s no difference between the two, and then generate() to create a null distribution. Finally, we calculate() the difference in medians for the two species.

null <- 
  penguins_sans_chin %>% 
  specify(body_mass_g ~ species) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 5000, type = "permute") %>% 
  calculate("diff in medians", 
            order = c("Gentoo", "Adelie"))

null %>% 
  visualize() +
  geom_vline(xintercept = diff$stat, color = "red", size = 1) +
  labs(x = "Difference in median body masses\n(Gentoo − Adelie)",
       y = "Count",
       subtitle = "Red line shows observed difference in median body masses") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank())

Visually, we can see that the observed difference (red line) is very far from the null distribution values. We can use get_pvalue() to be sure:

null %>% 
  get_pvalue(obs_stat = diff, direction = "both") # direction both to get two-tailed p-value
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

The p-value is so small that it’s displaying as 0. This means that there’s no chance you’ll see a difference in median body masses this big between Adelie and Gentoo in this (simulated) world where there’s no difference. We can safely infer that the difference in median body mass between the two species is significant.

Comparing differences in multiple groups (ANOVA)

ANOVAs are used to analyze differences in group means. As an illustrative example, we’ll look at the relationship between penguin species and bill length.

Exploratory data analysis

penguins %>% 
  ggplot(aes(x = species, y = bill_length_mm, fill = species)) +
  geom_boxplot() +
  scale_fill_brewer(palette = 5) + 
  # scale_y_continuous(breaks = seq(1, 10, 1)) +
  labs(
    title = "Bill length by species",
    x = NULL, y = "Bill length (mm)") +
  theme_minimal()

penguins %>% 
  ggplot(aes(x = bill_length_mm, fill = species)) +
  geom_histogram(binwidth = 5, color = "white") +
  scale_fill_brewer(palette = 5, guide = F) + 
  labs(y = "Count", x = "Bill length (mm)") +
  facet_wrap(~ species) +
  theme_minimal() +
  theme(panel.grid.major.x = element_blank())

penguins %>% 
  ggplot(aes(x = bill_length_mm, y = fct_rev(species), fill = species)) +
  stat_density_ridges(quantile_lines = TRUE, quantiles = 2, scale = 3, color = "white") + 
  scale_fill_brewer(palette = 5, guide = FALSE) + 
  # scale_x_continuous(breaks = seq(0, 10, 2)) +
  labs(x = "Bill length (mm)", y = NULL) +
  theme_minimal()

penguins %>% 
  group_by(species) %>% 
  summarize(avg_bill_length = mean(bill_length_mm, na.rm = TRUE))
## # A tibble: 3 × 2
##   species   avg_bill_length
## * <fct>               <dbl>
## 1 Adelie               38.8
## 2 Chinstrap            48.8
## 3 Gentoo               47.5

Using infer

Using the same functions as earlier, we can compute the 𝐹 statistic.

f_stat <- 
  penguins %>%
  specify(bill_length_mm ~ species) %>%
  hypothesize(null = "independence") %>%
  calculate(stat = "F") 

f_stat
## Response: bill_length_mm (numeric)
## Explanatory: species (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
##    stat
##   <dbl>
## 1  411.

The 𝐹 statistic is large: 410. We need to compare this observed 𝐹 statistic to a null distribution that supposes there is no relationship between species and bill length to see whether it is likely to see this observed 𝐹 statistic in this world.

null_anova <- 
  penguins %>%
  specify(bill_length_mm ~ species) %>%
  hypothesize(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "F") 

null_anova %>% 
  visualize() 

null_anova %>% 
  visualize() +
  shade_p_value(f_stat, direction = "greater")

null_anova %>%
  get_p_value(obs_stat = f_stat,
              direction = "greater")
## # A tibble: 1 × 1
##   p_value
##     <dbl>
## 1       0

We can see from visualization and p-value that there is 0 chance that we would see the observed test statistic in this world where there is no relationship between species and bill length. We can safely infer that the differences in bill lengths among species is most likely not due to randomness.

In conclusion

Your turn

Explore the following dataset from the General Social Survey:

gss
## # A tibble: 500 × 11
##     year   age sex    college  partyid hompop hours income class finrela  weight
##    <dbl> <dbl> <fct>  <fct>    <fct>    <dbl> <dbl> <ord>  <fct> <fct>     <dbl>
##  1  2014    36 male   degree   ind          3    50 $2500… midd… below a…  0.896
##  2  1994    34 female no degr… rep          4    31 $2000… work… below a…  1.08 
##  3  1998    24 male   degree   ind          1    40 $2500… work… below a…  0.550
##  4  1996    42 male   no degr… ind          4    40 $2500… work… above a…  1.09 
##  5  1994    31 male   degree   rep          2    40 $2500… midd… above a…  1.08 
##  6  1996    32 female no degr… rep          4    53 $2500… midd… average   1.09 
##  7  1990    48 female no degr… dem          2    32 $2500… work… below a…  1.06 
##  8  2016    36 female degree   ind          1    20 $2500… midd… above a…  0.478
##  9  2000    30 female degree   rep          5    40 $2500… midd… average   1.10 
## 10  1998    33 female no degr… dem          2    40 $1500… work… far bel…  0.550
## # … with 490 more rows

See ?gss for more information on the dataset.

After performing an exploratory data analysis, answer the following:

  1. Do Americans work the same number of hours a week regardless of whether they have a college degree or not?

  2. Is there an association between age and political party affiliation in the United States?

Linear regression with one predictor

As seen in lecture, the point of linear regression is to determine the relationship between an outcome variable and its explanatory/predictor variable(s). When there is only one explanatory/predictor variable, this is called simple linear regression.

We’ll explore the Ames housing data set (De Cock 2011) which contains data on 2,930 properties in Ames, Iowa. It’s part of tidymodels. If you don’t have tidymodels downloaded:

install.packages("tidymodels")
data(ames, package = "modeldata")

Exploratory data analysis

glimpse(ames)
## Rows: 2,930
## Columns: 74
## $ MS_SubClass        <fct> One_Story_1946_and_Newer_All_Styles, One_Story_1946…
## $ MS_Zoning          <fct> Residential_Low_Density, Residential_High_Density, …
## $ Lot_Frontage       <dbl> 141, 80, 81, 93, 74, 78, 41, 43, 39, 60, 75, 0, 63,…
## $ Lot_Area           <int> 31770, 11622, 14267, 11160, 13830, 9978, 4920, 5005…
## $ Street             <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pav…
## $ Alley              <fct> No_Alley_Access, No_Alley_Access, No_Alley_Access, …
## $ Lot_Shape          <fct> Slightly_Irregular, Regular, Slightly_Irregular, Re…
## $ Land_Contour       <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, L…
## $ Utilities          <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ Lot_Config         <fct> Corner, Inside, Corner, Corner, Inside, Inside, Ins…
## $ Land_Slope         <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, G…
## $ Neighborhood       <fct> North_Ames, North_Ames, North_Ames, North_Ames, Gil…
## $ Condition_1        <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, Norm, No…
## $ Condition_2        <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Nor…
## $ Bldg_Type          <fct> OneFam, OneFam, OneFam, OneFam, OneFam, OneFam, Twn…
## $ House_Style        <fct> One_Story, One_Story, One_Story, One_Story, Two_Sto…
## $ Overall_Cond       <fct> Average, Above_Average, Above_Average, Average, Ave…
## $ Year_Built         <int> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 199…
## $ Year_Remod_Add     <int> 1960, 1961, 1958, 1968, 1998, 1998, 2001, 1992, 199…
## $ Roof_Style         <fct> Hip, Gable, Hip, Hip, Gable, Gable, Gable, Gable, G…
## $ Roof_Matl          <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompSh…
## $ Exterior_1st       <fct> BrkFace, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Exterior_2nd       <fct> Plywood, VinylSd, Wd Sdng, BrkFace, VinylSd, VinylS…
## $ Mas_Vnr_Type       <fct> Stone, None, BrkFace, None, None, BrkFace, None, No…
## $ Mas_Vnr_Area       <dbl> 112, 0, 108, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6…
## $ Exter_Cond         <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Foundation         <fct> CBlock, CBlock, CBlock, CBlock, PConc, PConc, PConc…
## $ Bsmt_Cond          <fct> Good, Typical, Typical, Typical, Typical, Typical, …
## $ Bsmt_Exposure      <fct> Gd, No, No, No, No, No, Mn, No, No, No, No, No, No,…
## $ BsmtFin_Type_1     <fct> BLQ, Rec, ALQ, ALQ, GLQ, GLQ, GLQ, ALQ, GLQ, Unf, U…
## $ BsmtFin_SF_1       <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, …
## $ BsmtFin_Type_2     <fct> Unf, LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, U…
## $ BsmtFin_SF_2       <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0…
## $ Bsmt_Unf_SF        <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994,…
## $ Total_Bsmt_SF      <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, …
## $ Heating            <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, Gas…
## $ Heating_QC         <fct> Fair, Typical, Typical, Excellent, Good, Excellent,…
## $ Central_Air        <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ Electrical         <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SB…
## $ First_Flr_SF       <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, …
## $ Second_Flr_SF      <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0,…
## $ Gr_Liv_Area        <int> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616…
## $ Bsmt_Full_Bath     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, …
## $ Bsmt_Half_Bath     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Full_Bath          <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, …
## $ Half_Bath          <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, …
## $ Bedroom_AbvGr      <int> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, …
## $ Kitchen_AbvGr      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ TotRms_AbvGrd      <int> 7, 5, 6, 8, 6, 7, 6, 5, 5, 7, 7, 6, 7, 5, 4, 12, 8,…
## $ Functional         <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, T…
## $ Fireplaces         <int> 2, 0, 0, 2, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, …
## $ Garage_Type        <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, Att…
## $ Garage_Finish      <fct> Fin, Unf, Unf, Fin, Fin, Fin, Fin, RFn, RFn, Fin, F…
## $ Garage_Cars        <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, …
## $ Garage_Area        <dbl> 528, 730, 312, 522, 482, 470, 582, 506, 608, 442, 4…
## $ Garage_Cond        <fct> Typical, Typical, Typical, Typical, Typical, Typica…
## $ Paved_Drive        <fct> Partial_Pavement, Paved, Paved, Paved, Paved, Paved…
## $ Wood_Deck_SF       <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 48…
## $ Open_Porch_SF      <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0…
## $ Enclosed_Porch     <int> 0, 0, 0, 0, 0, 0, 170, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Three_season_porch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Screen_Porch       <int> 0, 120, 0, 0, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 140, …
## $ Pool_Area          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Pool_QC            <fct> No_Pool, No_Pool, No_Pool, No_Pool, No_Pool, No_Poo…
## $ Fence              <fct> No_Fence, Minimum_Privacy, No_Fence, No_Fence, Mini…
## $ Misc_Feature       <fct> None, None, Gar2, None, None, None, None, None, Non…
## $ Misc_Val           <int> 0, 0, 12500, 0, 0, 0, 0, 0, 0, 0, 0, 500, 0, 0, 0, …
## $ Mo_Sold            <int> 5, 6, 6, 4, 3, 6, 4, 1, 3, 6, 4, 3, 5, 2, 6, 6, 6, …
## $ Year_Sold          <int> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 201…
## $ Sale_Type          <fct> WD , WD , WD , WD , WD , WD , WD , WD , WD , WD , W…
## $ Sale_Condition     <fct> Normal, Normal, Normal, Normal, Normal, Normal, Nor…
## $ Sale_Price         <int> 215000, 105000, 172000, 244000, 189900, 195500, 213…
## $ Longitude          <dbl> -93.61975, -93.61976, -93.61939, -93.61732, -93.638…
## $ Latitude           <dbl> 42.05403, 42.05301, 42.05266, 42.05125, 42.06090, 4…
options(scipen = 999)

ames %>% 
  ggplot(aes(x = Sale_Price)) + 
  geom_histogram(bins = 50, color = "white", fill = "lightpink") +
  theme_minimal()

Given the right-skew of the data, sale price should be logged:

ames %>% 
  ggplot(aes(x = Sale_Price)) + 
  geom_histogram(bins = 50, color = "white", fill = "lightpink") +
  scale_x_log10() +
  theme_minimal()

ames %>% 
  ggplot(aes(x = Gr_Liv_Area)) + 
  geom_histogram(bins = 50, color = "white", fill = "skyblue") +
  # scale_x_log10() +
  theme_minimal()

ames %>% 
  ggplot(aes(x = Gr_Liv_Area)) + 
  geom_histogram(bins = 50, color = "white", fill = "skyblue") +
  scale_x_log10() +
  theme_minimal()

ames <- 
  ames %>% 
  mutate(Sale_Price = log10(Sale_Price), 
         Gr_Liv_Area = log10(Gr_Liv_Area))

Your turn

Visualize relationship between sale price and:

  • year built
  • building type
  • living area: Gr_Liv_Area
  • anything else you’re intrigued by

Introduction to tidymodels

  • {tidymodels} represents the R packages inside the tidyverse that focus on the modeling process:

  • share an underlying philosophy, grammar and data structures

  • principles: design for humans, reusing data structures, functional programming with the piping operator %>%

  • {tidymodels} is a group of R packages for modeling that:

  • have a consistent API

  • individual packages designed to do different things but also designed to work together

  • Motivation behind the creation of tidymodels:

  • Streamline the modeling process from end to end

  • Produce high quality models

  • Encourage good methodology and statistical practice

- Why is it important to be tidy? - a user-interface that fits all users needs (not just the makers) - flexible and can evolve over time

### Using tidymodels

For our first model, we’ll look into the relationship between sale price and living area.

The first step in a tidymodels workflow is to specify the model - linear regression - with parsnip::set_mode() and parsnip::set_engine():

set.seed(2021)

library(tidymodels)

lm_spec <- linear_reg() %>%
  set_mode("regression") %>%
  set_engine("lm")

lm_spec
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm

Once we have the specification we can fit it by supplying a formula expression and the data we want to fit the model on. The formula is written on the form y ~ x where y is the name of the response and x is the name of the predictors. The names used in the formula should match the names of the variables in the data set passed to data.

lm_fit <- 
  lm_spec %>%
  fit(Sale_Price ~ Gr_Liv_Area, data = ames)

lm_fit
## parsnip model object
## 
## Fit time:  1ms 
## 
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data)
## 
## Coefficients:
## (Intercept)  Gr_Liv_Area  
##      2.3583       0.9078
lm_fit %>% 
  pluck("fit") %>% 
  summary()
## 
## Call:
## stats::lm(formula = Sale_Price ~ Gr_Liv_Area, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90237 -0.06361  0.01147  0.07558  0.37360 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  2.35830    0.05057   46.63 <0.0000000000000002 ***
## Gr_Liv_Area  0.90781    0.01602   56.66 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1223 on 2928 degrees of freedom
## Multiple R-squared:  0.523,  Adjusted R-squared:  0.5228 
## F-statistic:  3210 on 1 and 2928 DF,  p-value: < 0.00000000000000022

Not tidy, so we’ll use broom::tidy() to get the key parameter information:

tidy(lm_fit)
## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    2.36     0.0506      46.6       0
## 2 Gr_Liv_Area    0.908    0.0160      56.7       0

broom:glance() can be used to extract the model statistics:

glance(lm_fit)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
## 1     0.523         0.523 0.122     3210.       0     1  2001. -3996. -3978.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

If you’re satisfied with the model fit, you can use predict() to generate predictions:

predict(lm_fit, new_data = ames)
## # A tibble: 2,930 × 1
##    .pred
##    <dbl>
##  1  5.28
##  2  5.04
##  3  5.19
##  4  5.38
##  5  5.27
##  6  5.27
##  7  5.20
##  8  5.18
##  9  5.27
## 10  5.31
## # … with 2,920 more rows

Setting type = "conf_int" returns a 95% confidence interval:

predict(lm_fit, new_data = ames, type = "conf_int")
## # A tibble: 2,930 × 2
##    .pred_lower .pred_upper
##          <dbl>       <dbl>
##  1        5.28        5.29
##  2        5.03        5.05
##  3        5.19        5.20
##  4        5.37        5.38
##  5        5.27        5.28
##  6        5.26        5.27
##  7        5.19        5.20
##  8        5.17        5.18
##  9        5.27        5.28
## 10        5.31        5.32
## # … with 2,920 more rows

We can use bind_cols() or tidymodels’ augment() to obtain the observed sale prices and the predicted sale prices:

pred_obs <- 
  augment(lm_fit, new_data = ames) %>% 
  select(Sale_Price, .pred)

pred_obs
## # A tibble: 2,930 × 2
##    Sale_Price .pred
##         <dbl> <dbl>
##  1       5.33  5.28
##  2       5.02  5.04
##  3       5.24  5.19
##  4       5.39  5.38
##  5       5.28  5.27
##  6       5.29  5.27
##  7       5.33  5.20
##  8       5.28  5.18
##  9       5.37  5.27
## 10       5.28  5.31
## # … with 2,920 more rows

Notice how the output is always a tibble. This makes results very flexible for more processing and plotting.

ggplot(pred_obs, aes(x = Sale_Price, y = .pred)) + 
  # Create a diagonal line:
  geom_abline(lty = 2) + 
  geom_point(alpha = 0.5) + 
  labs(y = "Predicted Sale Price (log10)", x = "Sale Price (log10)") +
  # Scale and size the x- and y-axis uniformly:
  coord_obs_pred()

We can obtain numerous performance metrics using metric_set():

ames_metrics <- metric_set(rmse, rsq, mae)

ames_metrics(pred_obs, truth = Sale_Price, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      0.122 
## 2 rsq     standard      0.523 
## 3 mae     standard      0.0917

Your turn

Build a model using another predictor from the Ames dataset.